-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use numbagg for ffill
by default
#8389
Conversation
Where does this happen? We should be able to switch between bottleneck and numbagg here: xarray/xarray/core/duck_array_ops.py Line 691 in f63ede9
and here: xarray/xarray/core/dask_array_ops.py Line 58 in f63ede9
then you'll get dask-aware ffill for free :) |
Super weird — I have a stack trace from
It's a sampling profiler, so it misses frames. But after 10 mins of looking, I can't see how that would possibly happen — I can't see where this would get called. Though IIRC bottleneck doesn't deal well with arrays with lots of dimensions, and that array has lots of dimensions; so I think it's still possible that something is unstacking to reduce the dimensions even I can't find it atm. Anyway!!
OK great! |
8a222ac
to
4710af4
Compare
for more information, see https://pre-commit.ci
I refactored this to use the I need to work through the version checks... |
Unfortunately I'm seeing some segfaults when running with:
This would not be a great experience. So I'll look at these before we consider merging; I suspect it'll be something upstream... |
OK, this is all done I think:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some nits
raise ImportError( | ||
"numbagg >= 0.2.1 is required for rolling_exp but currently numbagg is not installed" | ||
) | ||
elif _NUMBAGG_VERSION < Version("0.2.1"): | ||
elif pycompat.mod_version("numbagg") < Version("0.2.1"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we boost min supported numbagg version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we basically keep the 12 month cycle — it's some good infra, and I would like to be a good citizen for that process rather than deviate to save myself some boilerplate...
doc/whats-new.rst
Outdated
@@ -50,6 +50,10 @@ Documentation | |||
Internal Changes | |||
~~~~~~~~~~~~~~~~ | |||
|
|||
- :py:meth:`DataArray.bfill` & :py:meth:`DataArray.ffill` now use numbagg by | |||
default, which is up to 5x faster on wide arrays on multi-core machines. (:pull:`8339`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are "wide" arrays?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's a better term? I agree it's not great now. A "wide" table has lots of column that it can parallelize along. i.e. the benchmarks at https://github.com/numbagg/numbagg/#nd are 5x better than bottleneck, but just above, on a single column, it's identical, because the benefit comes from the parallelism.
Co-authored-by: Deepak Cherian <[email protected]>
Co-authored-by: Deepak Cherian <[email protected]>
Co-authored-by: Deepak Cherian <[email protected]>
Co-authored-by: Deepak Cherian <[email protected]>
* upstream/main: Raise an informative error message when object array has mixed types (pydata#4700) Start renaming `dims` to `dim` (pydata#8487) Reduce redundancy between namedarray and variable tests (pydata#8405) Fix Zarr region transpose (pydata#8484) Refine rolling_exp error messages (pydata#8485) Use numbagg for `ffill` by default (pydata#8389) Fix bug for categorical pandas index with categories with EA dtype (pydata#8481) Improve "variable not found" error message (pydata#8474) Add whatsnew for pydata#8475 (pydata#8478) Allow `rank` to run on dask arrays (pydata#8475) Fix mypy tests (pydata#8476) Use concise date format when plotting (pydata#8449) Fix `map_blocks` docs' formatting (pydata#8464) Consolidate `_get_alpha` func (pydata#8465)
The main perf advantage here is the array doesn't need to be unstacked & stacked, which is a huge win for large multi-dimensional arrays... (I actually was hitting a memory issue running an
ffill
on my own, and so thought I'd get this done!)We could move these methods to
DataWithCoords
, since they're almost the same implementation between aDataArray
&Dataset
, and exactly the same for numbagg's implementationFor transparency — the logic of "check for numbagg, check for bottleneck" I wouldn't rate at my most confident. But I'm more confident that just installing numbagg will work. And if that works well enough, we could consider only supporting numbagg for some of these in the future.
I also haven't done the benchmarks here — though the functions are relatively well benchmarked at numbagg. I'm somewhat trading off getting through these (rolling functions are coming up too) vs. doing fewer slower, and leaning towards the former, but feedback welcome...
whats-new.rst
api.rst